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Abstract 

Background: In a natural population, the alleles of nnultiple tightly linked loci on the sanne chromosome co-segregate 
and are passed non-randomly from generation to generation. Capitalizing on this phenomenon, a group of mapping 
methods, commonly referred to as the linkage disequilibrium-based mapping (LD mapping), have been developed 
recently for detecting genetic associations. However, most current LD mapping methods mainly employed 
single-marker analysis, overlooking the rich information contained within adjacent linked loci. 

Results: We extend the single-marker LD mapping to include two linked loci and explicitly incorporate their LD 
information into genetic mapping models (tmLD). We establish the theoretical foundations for the tmLD mapping 
method and also provide a thorough examination of its statistical properties. Our simulation studies demonstrate that 
the tmLD mapping method significantly improves the detection power of association compared to the single-marker 
based and also haplotype based mapping methods. The practical usage and properties of the tmLD mapping method 
were further elucidated through the analysis of a large-scale dental caries GWAS data set. It shows that the tmLD 
mapping method can identify significant SNPs that are missed by the traditional single-marker association analysis and 
haplotype based mapping method. An R package for our proposed method has been developed and is freely available. 

Conclusions: The proposed tmLD mapping method is more powerful than single marker mapping generally used 
in GWAS data analysis. We recommend the usage of this improved method over the traditional single marker 
association analysis. 
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Background 

Most economically, biologically and clinically important 
traits, such as those linked to poplar growth, cancer de- 
velopment and dental caries risk, are inherently complex 
in terms of their polygenic control and sensitivity to the 
environment [1]. The number of genes involved in these 
traits is typically large, each exerting a small effect and 
acting singly or interactively with others in a complicated 
network. For this reason, the genetic analysis of complex 
traits has been very difficult. However, a profound under- 
standing of the genetic control mechanisms of complex 
traits is crucial to economy and life. Therefore, the devel- 
opment of more powerful and complex genetic mapping 
methods has become increasingly urgent. 
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In recent years, with the advancement of new DNA- 
based biotechnologies, such as single-nucleotide poly- 
morphism (SNP) arrays, genome-wide association studies 
(GWAS) have become feasible to dissect the phenotypic 
variation of a complex trait into individual genetic compo- 
nents. Particularly, SNP arrays have gained popularity due 
to their cost-effectiveness: in year 2011 alone, 1068 GWAS 
were performed, each with at least 100,000 SNPs geno- 
typed (www.genome.gov/gwastudies). Based on the most 
recent summary data of dbSNP database (www.ncbi.nlm. 
nih.gov/projects/SNP), there are - $38 million (about 1 
percent of the total genome) of validated SNPs in human 
genome. However, even the densest SNP array on the 
market can only accommodate -1 million SNPs, and 
hence a great percentage of SNPs is not able to be sam- 
pled in a real genetic study. Fortunately, SNPs in the gen- 
ome are not independent from each other, i.e. they are 
locally connected and form the so-called linkage disequi- 
librium (LD) blocks. Because of this unique correlation 
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structure, the sampled genetic markers carry partial infor- 
mation about the unsampled SNPs and may be used for 
genomewide association analyses. 

LD is a phenomenon arising from the co-inheritance 
of alleles at nearby loci on the same chromosome, and is 
defined as the deviation of the observed frequency of a 
haplotype from random association [2]. Historically, LD 
analysis was developed to quantify the genetic structure 
and the diversity of natural populations [3-5]. Many efforts 
have been put into developing dense maps of molecular 
markers for a wide variety of species. For example, LD 
structures have been estimated in human [6] as well as 
Holstein cattle [7], sheep [8] and dog [9]. With some re- 
gularity conditions [2], it can be shown that a LD value 
between any two loci decays with generations at the re- 
combination rate between them: 

= (i-r)/)W (1) 

where D^^^^^ is the LD value at generation t-\-l and r is 
the recombination rate between the two loci. Therefore, 
the LD value approaches to zero gradually at a geomet- 
ric rate of 1-r. The larger the r, the faster the rate of 
convergence. According to Equation (1), if a significant 
D^'^^^ value can be detected in the current generation, it 
impUes r must be very small, almost close to 0, under 
the assumption that the initial LD was generated long 
time ago {i.e. t is large). This assumption is plausible be- 
cause it does take a long time for mutations/LD to be 
spread in a population. Therefore, the principle of link- 
age disequilibrium decaying with generation builds up 
an alternative mapping strategy [10,11], which provides 
an important tool for the fine mapping of genes affect- 
ing a quantitative trait. 

The LD mapping based on a single marker has been 
greatly studied [12-14]. However, little effort has been put 
on the LD mapping with multiple markers. Motivated by 
the seminal work of interval mapping proposed by Lander 
and Botstein in 1989 [15], in which genetic mapping was 
performed based on two neighboring genetic markers in 
controlled experiments, we propose to develop a new LD 
mapping framework that utilizes two SNP markers in a 
natural population. The new model explicitly incorporates 
the LD information between two markers into the map- 
ping analysis, and thus we expect the analysis based on 
two markers is more powerful than that based on a single 
marker in a natural population just as Lander and Botstein 
have discovered in the controlled experiment. In the fol- 
lowing sections, we first laid out the modeling framework 
for the two-marker LD mapping (tmLD), with details on 
parameter estimation and hypothesis testing. We then fur- 
ther elucidated our method through extensive simulation 
studies. Finally, we applied our method to a GWAS dental 
caries data set, followed by some discussions. 



Methods 

Two-marker LD (tmLD) mapping 

In the tmLD mapping framework, we assume a dichot- 
omous quantitative trait locus (QTL, Q) of alleles Q and 
q that is causal but unobserved, and the allele frequen- 
cies of Q and q are expressed as p2 and 1-/^2- Suppose 
that this QTL is genetically associated with two geno- 
typed SNP markers, and of two alleles Mi and mi, 
and M2 and ^2, with corresponding frequencies of pi and 
I'Pi, and p^ and l-p^, respectively. Further suppose the 
three linked SNPs in a tandem order, Q and tM2 at loci 
1, 2 and 3, and the recombination rates between and 
Q, between Q and ^2^ and between and ^2 are ri2, 
r23 and ris, respectively. The three SNPs form 8 possible 
haplotypes: M1QM2 (111), MiQm2 (110), MiqM2 (101), 
Miqm2 (100), miQM2 (Oil), miQm2 (010), miqM2 (001), 
miqm2 (000). To describe the linkage disequilibrium 
among them, their frequencies can be represented as fol- 
lows using four trigenic disequilibria parameters D12, D23, 
Dis and D123 (Additional file 1): 

=M(1-P1)^-''W(1-P2)'"V3'(1-P3)'"' +^//)^ 

(2) 

and D,j, = \ [{-lp\D,2 + {-lt'^D2, + (-l)'''"'' A3- 
(-1)''+^+^'"^'Di23] where ijj< = 0,h ^12,^23,^3 have 
exactly the same meaning as those in digenic disequilib- 
ria models for loci at positions 1/2, 2/3 and 1/3; and 
D123 is an additional trigenic disequilibria parameter for 
three loci together. Model (1) implies that /)i2) ^23^^13 
all geometrically decay with generations. It can be shown 
that with some reasonable assumptions, the D123 decreases 
with generations at a rate of (l-ris) and therefore also 
changes very slowly with time (Additional file 2). Hence, 
significant D12, D23, and D123 at current generation imply 
ri2and r23 are very small, which form the basis for LD 
mapping using two genetic markers. 

Likelihood function 

Suppose there is a random sample of size n drawn from 
a natural human population at Hardy- Weinberg equilib- 
rium. In this sample, multiple polymorphic sites, e.g. single 
nucleotide polymorphism (SNP), are genotyped, aiming at 
the identification of QTL affecting a continuous trait. The 
relationship between the observed phenotypic values and 
their expected means, determined by QTL genotypes, can 
then be described by the following model, 

yi = Y^%^^j^j^^i^ / = i,...,fz (3) 

Where yi is the phenotypic values for subject /, is 
an indicator variable defined as 1 if subject /, which con- 
tains markers {^a, ^n), has a QTL genotype ; (/ = 2 for 



Yang et a I. BMC Genetics 2014, 15:20 
http://www.bionnedcentral.conn/1471-21 56/1 5/20 



Page 3 of 9 



QQ, 1 for Qq and 0 for qq) and 0 otherwise, f^j is the ex- 
pected phenotypic value for QTL genotype and is the 
error term reflecting the polygenic effects of other unlinked 
genes and the environmental effect, which can be assumed 
to follow A^(0, o^) if y is continuous. The conditional prob- 
ability of subject / with its given markers carrying a certain 
QTL genotype;, ^;|,=p(q=; 1^,1,^,2) P(^iJ = can be cal- 
culated from Table 1. Therefore, the likelihood of the quan- 
titative trait (y) and molecular markers (^^i, for one 
putative QTL (Q) and can be constructed by a mixture 
model: 

n 2 
i=l ;=0 

where is a vector of the population genetic parame- 
ters {pi>P2>P3>Di2,D23)Dis,Di23) that is used to describe 
frequencies of haplotypes formed by markers and QTL 
and subsequently nj\i s, is a vector of the quantitative 
genetic parameters that define genotype-specific traits, 
which contains (i^pj =1, 2, 3, and cr) for a continuous trait 
that is assumed to be normally distributed, and fj{') is the 
probability density function for QTL genotype 

The likelihood function provides a model for obtaining 
the maximum likelihood estimates of the unknown param- 
eters {dp, n^), which can be achieved by differentiating 



Table 1 Joint zygote probabilities of the QTL genotypes at QTL Q and two-marker genotypes at markers M1 and M2, 
as expressed in terms of zygote configurations in a natural population 

Marker Joint marker-QTL genotype frequency 



Genotype 




Frequency 


qq (0) 


Qq (1) 


QQ (2) 




(00) 


Poo 


Pooo 

(nooo) 


2poioPooo 
(Hoio) 


Poio 
(no2o) 




(01) 


2poiPoo 


2pooiPooo 
(Hooi) 


2poi 1P000 + 2poioPooi 
(Hon) 


2poioPoi 1 

(n02l) 




(02) 


Poi 


Pool 

(^002) 


2poi 1P001 

(^01 2) 


Poll 

(no22) 




(10) 


2pooPio 


2piooPooo 

(Hi 00) 


2p 1 1 oPooo + 2p 1 ooPo 1 0 
(niio) 


2pi 10P010 

(ni20) 




(11) 


2piiPoo 
+ 2pioPoi 


2pioiPooo + 2piooPooi 

(^lOl) 


2pi 1 1P000 + 2pi 10P001 
+ 2pioiPoio + 2piooPoi 1 

(Hi 11) 


2piiiPoio + 2piioPoi 
(^121) 




(12) 


2piiPoi 


2pioiPooi 


2piiiPooi +2pioiPoii 
(^112) 


2piiiPoii 
(^122) 




(20) 


Pio 


P^oo 

(^200) 


2pi 10P100 
(^210) 


PllO 

(^220) 




(21) 


2piiPio 


2pioiPioo 

(/^20l) 


2piiiPioo + 2piioPioi 
(^211) 


2piioPiii 
(^221) 




(22) 




P101 

("202) 


2piiiPioi 

(^21 2) 


Pill 

(n222) 



the log-likelihood with respect to each unknown param- 
eter, setting the derivatives equal to zero and then solving 
the equations. The log-likelihood function of the pheno- 
typic values is given by 

n r 2 

1= iog[i(a^,a,|y, ^1,^2)] = ^ log Y,^j\jj{y,\a,) 

i=l L y=o 

Computational algorithms 

Within the maximum likelihood estimation framework, 
an efficient EM algorithm can be implemented to obtain 
the MLEs of (H^, H^), and is summarized into the fol- 
lowing steps: 

Step 1. Give initial values for the unknown parameters 

Step 2. E step - Calculate the posterior probabilities for 
each subject / to carry a particular QTL genotype 7 

1 • T-r ^i\if i{yi\^ci) 

usmg the equation \Aj\i = 2 / 1 

Step 3. M step - Solve the log-likelihood equations for 
each parameter based on observed data and rTyi^ to 
obtain its estimate. To estimate the quantitative 
genetic parameters (H^), their expressions in closed 
forms can be derived based on the estimation 
equations. For the estimates of the population genetic 
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parameters (H^), another inner layer of EM algorithm 
can be employed. 
Step 4. Repeat the E and M steps until the estimates 
converge to stable values. The estimates at 
convergence are the MLEs of parameters. 

The detailed derivation for the EM algorithm is given 
in Additional file 3. 

Hypothesis testing 

In general, the hypothesis testing of QTL mapping in- 
cludes two steps: (1) the existence of QTL and (2) their 
locations. The focus of this study is on the second step, 
assuming that sufficient evidences for the existence of 
QTL have been collected to enable a large-scale geno- 
typing study. Then the hypotheses for the tmLD method 
can be formulated as follows: 

Ho : The QTL is not associated with two SNP markers, 
i.e. Du=D23=Du3=0: Hi : Not/fo 

The estimates of the parameters under the null hy- 
potheses can be obtained with the same EM algorithm 
derived for the alternative hypotheses, but with a constraint 
that all subjects have the same posterior probability. A like- 
lihood ratio test (LRT) statistics can be constructed and 
computed to draw the inference about whether a QTL 
may be associated with given markers. Under the Hq, the 
LRT statistics asymptotically follows a /^-distribution with 
three degrees of freedom. 

Results 

Simulation settings 

Extensive Monte Carlo simulation experiments were per- 
formed to examine the statistical properties of the proposed 
tmLD mapping method. Since in a genome-wide scan, a 
QTL must be located between some pair of markers, in the 
experimental design of simulations, we considered two sce- 
narios as illustrated in Figure 1: (1) the QTL is assumed to 
be unobserved, but it is in LD with two adjacent SNPs; and 
(2) the QTL is assumed to be one of the genetic markers 
and therefore genotyped. 



QTL 
QTL k) 

SNP1 SNP2 

Figure 1 Two simulation settings. (1) QTL is unobserved but in 
linloge equilibrium witli two adjacent SNPs. (2) QTL is observed as 
one of tine SNP marl<ers. 

v J 



Let us randomly choose a sample of n subjects from a 
human population at Hardy- Weinberg equilibrium. In this 
population, one QTL is segregating and is inferred by a pair 
of markers. The allele frequencies of the markers and 
^2) and QTL (Q) and their linkage disequilibria values are 
given as follows: pi = 0.5 for allele Mi of ^i; p2 = 0.5 for 
allele Q of Q; ps = 0.5 for allele M2 of The LD pa- 
rameters among the markers and QTL loci are given as: 
D12 = 0.05, Dis = 0.15, D23 = 0.05 and D123 = 0.04. For sub- 
jects who carry QTL genotype their phenotypic values 
were simulated based on Model (3), with ^2 = 10? f^i = 5, 
1^0 = 0. The variances in phenotypic values were calculated 
based on different heritability values (//^). H^ quantifies 
the genetic contribution from the QTL to the overall trait 
and H^ = 0 implies that the means for three QTL genotype 
groups are the same, which are set to be 0. With the above 
given parameters and design, we simulated the phenotypic 
and marker information by assuming different sample 
sizes (A^=100, 250, 500, 1000, 1500, 2000, 2500, 3000), 
and different heritability values {H^ = 0, 0.05, 0.1, 0.2, 0.3, 
0.4). Each simulation setting is carried out 1000 times for 
the evaluation of power and type I error. 

Type I error evaluation and power comparison 

Simulated data were used to compare our proposed tmLD 
method with single-marker based association analyses, in- 
cluding the single-marker LD mapping method (smLD) 
and single-marker based association test (smAT), and two- 
marker based haplotype analysis (haplo). The smLD was 
performed as described in Additional file 4. The smAT is a 
simple linear regression model with phenotypic trait as re- 
sponse variable and marker genotypes as categorical inde- 
pendent variable. The haplotype analysis was conducted as 
described in [16]; briefly, the haplotype that yields the best 
model fitting among those formed by two markers is used 
in comparison with tmLD. 

Under the simulation scenario 1, where the QTL is in 
LD phase with both markers, the results suggest that the 
association analysis based on two markers is significantly 
higher than the single- marker based and also haplotype 
based methods. Figure 2 shows that as the heritabflity 
increases, the power of each method increases corres- 
pondingly as expected. When H^ = 0, which suggests no 
QTL effects, aU methods maintained the nominal type I 
error (0.05); when H^ ^ 0, the two-marker association 
performed consistently better than others, and as ex- 
pected, the power increased with the sample size. 

Under the simulation scenario 2, where the QTL is set to 
be the marker 1, the most powerfiil test is the single marker 
association method using marker 1, and the power of the 
single marker association based on marker 2 is significantly 
lower (Figure 3). However, the tmLD analysis is almost as 
powerful as the optimal test, particularly when the sample 
size is reasonably large (N > 1000). This demonstrates that 
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H2 = 0.2 H2 = 0.3 H2 = 0.4 




0 500 1500 2500 0 500 1500 2500 0 500 1500 2500 



Sannple Size Sannple Size Sannple Size 

Figure 2 Power comparison when QTL is in linlcage disequalibria with both marlcer 1 and marlcer 2. The power curves were constructed 
under different heritability (H^). smAT_ml and smAT_m2 denote the single-marker association analyses for marker 1 and marker 2, respectively; 
smLD_ml and smLD_m2 denote single-marker LD mapping using marker 1 and marker 2, respectively; and hapio is for the two-marker based 
haplotype analysis. 



even when the QTL is indeed sampled in a genomic study, 
our proposed model is as good as the optimal test. These 
simulation results demonstrate the power advantage and 
robustness of our proposed method comparing with exist- 
ing methods based on single marker. Its practical usage was 
further elucidated in a real GWAS data set. 

Real data example 

Dental caries or cavities, more commonly known as 
tooth decay, is one of the most common chronic disor- 
ders in humans, affecting approximately 40% children 
and adolescents and 90% adults in the US. The etiology 
and pathogenesis of dental caries have been determined 
to be multifactorial, such as environmental factors re- 
lated to social behaviors [17]. However, it is also appar- 
ent that some individuals are very susceptible to caries 
while some others are more resistant, almost irrelevant 
to the environmental risk factors they are exposed to. 



suggesting that genetic factors may play prominent roles 
in the caries development. Supported by evidence in 
both human and animal studies [18-21], the caries herit- 
ability has been estimated to be between 30-60%. The 
most compelling evidence come from the twin studies 
that the significant resemblance of dental caries lies 
within monozygotic but not dizygotic twin pairs [22,23]. 
So it is without question that in addition to environmen- 
tal factors, genetic components also profoundly influ- 
ence the dental caries trait. To understand the genetic 
mechanisms of the dental caries, a GWAS study has 
been conducted and the dataset has been deposited in 
dbGaP (Study Accession: phs000095.v2.pl). Here we will 
apply our proposed model to analyze this caries GWAS 
dataset, in which 1843 adults were genotyped with a 
large panel of SNPs (610,000). We carried out the ana- 
lysis using the caries outcomes that have been well de- 
fined in other GWAS studies, i.e. the DIMFT index 
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Figure 3 Power comparison when QTL is at the exact position of the marlcer 1. The power curves were constructed under different 
heritability (H^). The tmLD model performs almost identically with the true model even when the QTL is the marker 1. smAT_ml and smAT_m2 
denote the single-marker association analyses for marker 1 and marker 2, respectively; smLD_ml and smLD_m2 denote single-marker LD mapping 
using marker 1 and marker 2, respectively; and hapIo is for the two-marker based haplotype analysis. 



which quantifies the total permanent tooth caries with 
white spots. 

smAT, smLD, haplo and tmLD association methods 
were appUed to the data. After removing SNPs that do 
not satisf)^ HWE (p-values < 10'^) and also SNPs with 
minor allele frequency less than 0.1, the number of SNPs 
that were included in the analysis is 443,175. To com- 
pare the performance of all methods, we plotted out the 
association signals at each SNP locus. Figures 4 and 5 
show the Manhattan plots of the -loglO(p-values) from 
smAT and tmLD methods, respectively, and the dashed red 
line corresponds to the genome-wide Bonferroni threshold 
(l.lE-7). SNPs that passed this threshold are considered to 
be significant and were tabulated in Table 2. For the haplo 
and smLD methods, since no significant SNP was identified 
by these two methods, their Manhattan plots were not 
shown. Particularly, the tmLD model identified two signifi- 
cant genes, CNTN5 and COL4A2, which have been shown 



from other studies to be associated with dental related 
phenotypes in other studies [24], validating the findings of 
our model biologically. None of the other three methods 
(smAT, smLD or haplo) found these two genes. The smAT 
identified another significant locus. However, gene anno- 
tation shows that it is not related to any known genes, so 
its biological implication remains unclear. 

Discussion 

It is well recognized that naturally occurring variations in 
most complex disease traits have a genetic basis and conse- 
quently many GWAS studies have been conducted in the 
past few years. In analyzing these data, a phenomenon, 
called "missing heritability", has been observed that the de- 
tected genetic variants can explain only a small portion of 
the heritability of phenotypic traits while a majority part re- 
mains mysterious [25]. Part of the reason may be attributed 
to the lack of power in current methods. Thus, developing 
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novel and powerful methods to better detect significant 
genes has been of great interest. Currently the routine 
GWAS analyses seek single-marker association between 
SNPs and phenotype, and when a significant association is 
detected, it implies that there might be some SNP(s) in 
linkage that are causal Note that it cannot imply the test 
SNP itself is causal because there is no guarantee that the 
truly causal SNPs would have been genotyped. Since the 



interpretation of a significant association relies on the link- 
age concept, it is sensible to directly incorporate the LD in- 
formation into association models. Additionally, due to the 
structure of LD blocks, a causal SNP is usually in linkage 
with multiple neighboring SNPs, all of which carry partial 
information about it. So in this sense, a new model that 
can incorporates more genetic information of linked SNPs 
should draw better inferences about the causal SNP. 
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Table 2 List of significant SNPs with p-value < 1.1e-7 in the Caries dataset 

SNP ID Gene Chr Coordinate Allele MAP Pj^at PsmLo Phapio PtmLo 

rs7607421 - 2 220500564 CfT 0.390 3.2E-08* 2.1E-04 2.0E-06 6.9E-05 

rsl 0790497 CNTN5 11 98539071 A/G 0.346 8.8E-01 8.2E-01 1.7E-03 2.6E-08* 

rs7319311 COL4A2 13 109828579 A/G 0.326 5.8E-02 2.7E-02 2.8E-02 1.0E-07* 

PsmAT, PsmLD, Phapio, PtmLD: P values for Corresponding methods. ^Significant SNPs identified by smAT. *Significant SNPs identified by tmLD. 



In this article, we proposed a novel statistical method 
by considering two SNPs simultaneously. Our model is 
built upon the general LD mapping framework, and ex- 
tends the previous methods based on single-marker 
LD. The simulation studies demonstrated that our new 
methods dramatically improved the detection power of 
the underlying QTLs. This is intuitively reasonable since 
our model can capture the linkage information between 
SNP markers, and hence has more power to detect the 
particular QTL that are in LD with both markers. Further- 
more, the simulation studies indicated that even when the 
underlying QTL is indeed genotyped and is one of the 
markers, the performance of the tmLD analysis is nearly 
identical to that of the optimal test resulting from the 
causal SNP, suggesting the robustness of our model. 

We applied our model to a GWAS date set that aimed 
to understand the genetic mechanisms of the dental car- 
ies. The data set contains a large cohort of 1,843 subjects 
as well as a very large number of SNPs (443,175). This 
shows that both our proposed method and the corre- 
sponding software package in R can be well applied to a 
typical GWAS data set. In addition, we also observed 
that the association analyses based on the single-marker 
and the two-marker models yielded different profiles of 
significant SNPs. This is somewhat expected since their 
assumptions are different. For the tmLD method, we as- 
sume that both markers must obey HWE and have to be 
in LD with the casual SNP. It might be possible that 
some SNPs would violate these assumptions and become 
unsuitable to the tmLD. In this sense, the single and 
two-marker analyses may be complementary to each other, 
and therefore it might be beneficial to use both methods in 
analyzing a real data set. 

Sometimes population structure may be a concern in a 
GWAS analysis if subpopulations indeed exist in the 
sample, as it may lead to spurious associations. Several 
well-known methods developed to account for population 
structure [26] can be incorporated into our LD mapping 
framework to address this issue. For instance, the principal 
component analysis (PCA) can be applied to correct for 
stratifications [27] . That is, we may first apply PCA on the 
genotype data and then choose the first few large principal 
components to be included in the Model (3) as additional 
covariates. With slight modifications, the computation al- 
gorithms and hypothesis testing described in the Method 
section can be readily applied. 



In this work, we generalized the single marker LD ana- 
lysis to a more general LD mapping framework using 
two adjacent markers. There are several ongoing works 
worthy of further investigation. First, the model can be 
easily extended to other types of phenotypic data, such 
as case-control binary and count data. Second, currently 
the two adjacent markers were used for the analysis; 
however, it is possible that another two markers in the 
same LD block might have better power, so it would be 
very interesting to determine how to choose the best SNP 
pair. Third, typically, one LD block may contain several 
SNPs, and if there exists one causal SNP within the LD 
block, it would be very interesting to see if we can 
summarize all SNPs in one LD block to make even better 
inference about the unobserved QTL. 

Conclusions 

The proposed tmLD model is a novel mapping method 
that can simultaneously consider two linked SNPs in a nat- 
ural population. Through the extensive simulation studies, 
the tmLD method demonstrates better power than single- 
marker mapping strategies traditionally used in GWAS as- 
sociation analysis. The practical usage of the tmLD method 
was also shown in the analysis of a large-scale dental 
GWAS dataset. Hence, we recommend the usage of this 
improved method over the traditional single-marker asso- 
ciation analysis. 

Software availability 

http://www.ams.sunysb.edu/~songwu/software.html. 
Additional files 



Additional file 1: Representation of three-loci haplotypes with four 
LD parameters. 

Additional file 2: Derivation of how Dt23 may change with time. 

Additional file 3: Derivation of the EM algorithm used to find MLEs 
for a mixture model. 

Additional file 4: Single-marker based LD mapping. 
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